Question 1: Given some sample data, write a program to answer the following: click here to access the required data set
On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis
Lets go ahead and load the given data set
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## Warning: package 'stringr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
df<-read.csv("C://Users//maila//Desktop//Test REPL//REPL_ML_Exercise//2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv", header = T)
head(df)
## order_id shop_id user_id order_amount total_items payment_method
## 1 1 53 746 224 2 cash
## 2 2 92 925 90 1 cash
## 3 3 44 861 144 1 cash
## 4 4 18 935 156 1 credit_card
## 5 5 18 883 156 1 credit_card
## 6 6 58 882 138 1 credit_card
## created_at
## 1 2017-03-13 12:36:56
## 2 2017-03-03 17:38:52
## 3 2017-03-14 4:23:56
## 4 2017-03-26 12:43:37
## 5 2017-03-01 4:35:11
## 6 2017-03-14 15:25:01
It is said that the AOV found is #3145.13. Lets try to understand where this is coming from. For now lets calculate the same by taking straight average of the sale value
It looks like the above value is a straight average value taken from the order amount.
mean(df$order_amount)
## [1] 3145.128
summary(df)
## order_id shop_id user_id order_amount
## Min. : 1 Min. : 1.00 Min. :607.0 Min. : 90
## 1st Qu.:1251 1st Qu.: 24.00 1st Qu.:775.0 1st Qu.: 163
## Median :2500 Median : 50.00 Median :849.0 Median : 284
## Mean :2500 Mean : 50.08 Mean :849.1 Mean : 3145
## 3rd Qu.:3750 3rd Qu.: 75.00 3rd Qu.:925.0 3rd Qu.: 390
## Max. :5000 Max. :100.00 Max. :999.0 Max. :704000
## total_items payment_method created_at
## Min. : 1.000 Length:5000 Length:5000
## 1st Qu.: 1.000 Class :character Class :character
## Median : 2.000 Mode :character Mode :character
## Mean : 8.787
## 3rd Qu.: 3.000
## Max. :2000.000
From above, we see that the order amount has a maximum value of 704000 which looks like an outlier. Now, lets look at what exactly is this transaction.
subset(df,df$order_amount==704000)
## order_id shop_id user_id order_amount total_items payment_method
## 16 16 42 607 704000 2000 credit_card
## 61 61 42 607 704000 2000 credit_card
## 521 521 42 607 704000 2000 credit_card
## 1105 1105 42 607 704000 2000 credit_card
## 1363 1363 42 607 704000 2000 credit_card
## 1437 1437 42 607 704000 2000 credit_card
## 1563 1563 42 607 704000 2000 credit_card
## 1603 1603 42 607 704000 2000 credit_card
## 2154 2154 42 607 704000 2000 credit_card
## 2298 2298 42 607 704000 2000 credit_card
## 2836 2836 42 607 704000 2000 credit_card
## 2970 2970 42 607 704000 2000 credit_card
## 3333 3333 42 607 704000 2000 credit_card
## 4057 4057 42 607 704000 2000 credit_card
## 4647 4647 42 607 704000 2000 credit_card
## 4869 4869 42 607 704000 2000 credit_card
## 4883 4883 42 607 704000 2000 credit_card
## created_at
## 16 2017-03-07 4:00:00
## 61 2017-03-04 4:00:00
## 521 2017-03-02 4:00:00
## 1105 2017-03-24 4:00:00
## 1363 2017-03-15 4:00:00
## 1437 2017-03-11 4:00:00
## 1563 2017-03-19 4:00:00
## 1603 2017-03-17 4:00:00
## 2154 2017-03-12 4:00:00
## 2298 2017-03-07 4:00:00
## 2836 2017-03-28 4:00:00
## 2970 2017-03-28 4:00:00
## 3333 2017-03-24 4:00:00
## 4057 2017-03-28 4:00:00
## 4647 2017-03-02 4:00:00
## 4869 2017-03-22 4:00:00
## 4883 2017-03-25 4:00:00
From above, we see that there are many transactions with an order amount 70400$ which is done using credit card by same user id 607 with a same shop id 42 and purchased same items which are 2000. This is so weird. It looks like the user is purchasing every 3 days at one particular point same items and in some days the data is duplicated especially on 2017-03-28.
Lets try to see if there are any other transactions like this in our data set. This can be found by visualising the given data set.
a=ggplot(df, aes(x=order_id, y=order_amount))+geom_point()+theme_classic()
b=ggplotly(a)
b
From above it shows that these transactions with 70400$ are the big outlier when calculating the AOV. It is caused due to the fact that the rows are duplicated and also the there is something not right with this transaction which is done every 3 days. It may be possible only if Shopify was running a sale and there is a limit in the purchase quantity per user/day or 3 days. This has lead for the buyer to accumulate stock at a cheaper price from shopify and he may be planning to sell it high post shopify sale or in his retail.
One way to look at this may be using a Median value because mean is not always reliable in this skewed data sets. When we look at median we will get an AOV of 284$ which is close to actuals.
I would report a median value for this data set if asked and highlight the transaction which looks like an outlier and clean the raw data set to prevent duplicate transactions.
The median value as said above would be 284$
median(df$order_amount)
## [1] 284
Total orders shipped via speedy express are 54
The Last Name is Peacock
3.SELECT Customers.Country,
OrderDetails.ProductID,OrderDetails.Quantity,Products.ProductID,Products.ProductName FROM Customers INNER JOIN Orders on Customers.CustomerID=Orders.CustomerID INNER JOIN OrderDetails ON Orders.OrderID=OrderDetails.OrderID INNER JOIN Products ON OrderDetails.ProductID=Products.ProductID where Country==“Germany” Group By ProductName Order By Quantity Desc Limit 1
The top selling product in Germany is “Steeleye Stout”